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Abstract 

Set  based  analysis  is  an  approach  to  compile-time  program  analysis  that  is  based  on  a  simple 
approximation:  all  dependencies  between  variables  are  ignored.  In  effect,  program  variables 
are  treated  as  sets  of  values.  Thus  far,  set  based  analysis  techniques  have  focussed  on  data- 
constructor  languages.  The  main  reason  for  this  is  algorithmic:  the  equality  for  data-constructor 
values  is  structural  (that  is,  two  values  f(v\,...,vn)  and  f'(v\ , . . . ,  v'n )  are  equal  if  and  only  if 
/  and  /'  are  identical  constructors  and  v,  =  v[,  i  =  l..n).  This  has  important  implications  for 
how  sets  of  such  values  can  be  represented  during  the  computation  of  a  set  based  analysis. 

In  contrast,  the  equality  theory  of  arithmetic  is  much  richer.  Two  terms  with  very  different 
structure  can  be  equal.  Correspondingly,  the  manipulation  and  representation  of  sets  of  arithmetic 
values  is  significantly  more  complex.  In  this  paper  we  extend  the  ideas  of  set  based  analysis  to 
arithmetic  expression  in  such  a  way  that  the  analysis  yields  descriptions  of  how  arithmetic  values 
are  computed.  Importantly,  this  extended  analysis  yields  useful  information  about  the  arithmetic 
components  of  a  program  while  maintaining  the  efficiency  of  the  basic  set  constraint  approach. 
We  show  how  this  information  can  be  exploited  during  compilation  with  two  examples  involving 
array  bounds  elimination.  While  this  work  is  carried  out  in  the  context  of  the  ML,  the  techniques 
developed  appear  to  be  applicable  to  other  languages. 


1  Introduction 


Set  based  analysis  [S,  7,  S]  is  an  approach  to  compile-time  program  analysis  that  is  based  on  a 
simple  notion  of  approximation:  all  dependencies  between  variables  are  ignored.  This  notion  is 
formalized  by  treating  program  variables  as  denoting  sets  of  values  instead  of  individual  values. 
For  example,  if  at  some  point  in  a  program,  the  environments  [ih.  1 ,  y>-+2]  and  [x>-»3,  y*-»4]  can 
be  encountered,  then  the  set  based  analysis  of  the  program  will  introduce  set  variables  X  and  y 
to  represent  the  respective  values  of  x  and  y  at  the  given  point,  and  X  will  contain  both  1  and 
3,  and  y  will  contain  2  and  4.  In  effect,  the  x-y  dependency  that  “x  is  1  when  y  is  2”  and  “x 
is  3  when  y  is  4”  are  ignored;  instead  only  the  sets  of  values  for  x  and  y  are  retained.  Call  the 
approximation  that  arises  from  this  interpretation  the  program’s  set  based  approximation. 

Computationally,  set  based  analysis  proceeds  by  extracting  set  constraints  from  a  program 
such  that  the  least  solution  of  the  constraints  corresponds  exactly  to  the  program’s  set  based 
approximation.  These  constraints  are  then  input  to  a  solver  that  computes  an  explicit  representa¬ 
tion  of  the  least  solution  of  the  constraints  (the  constraint  solving  process  is  the  main  algorithmic 
component  of  the  analysis). 

To  date,  set  based  analysis  techniques  have  focussed  on  data-constructor  languages.  The 

main  reason  for  this  is  that  two  values  /( t?i , . . . ,  vn )  and  f'(v\ . v^)  are  equal  if  and  only  if 

/  and  /'  are  identical  constructors  and  v,  =  v[,  i  =  l..n.  This  has  important  implications  for 
the  constraint  solving  process.  In  particular,  it  means  that  the  least  solution  of  the  constraints 
can  be  incrementally  built  up  in  a  form  such  that  at  any  time,  the  structure  of  the  partial  solution 
constructed  thus  far  is  explicit,  and  questions  about  membership  and  emptiness  can  be  directly 
answered. 

However,  when  set  based  analysis  is  extended  to  arithmetic,  a  problem  arises:  the  equality 
theory  of  arithmetic  is  much  richer  than  the  structure  equality  of  data-constructors.  Two  terms 
with  very  different  structure  can  be  equal  (for  example,  3  and  (42  -  48)  +  9  represent  the  same 
value).  The  essential  problem  is:  how  can  we  explicitly  represent  and  incrementally  build  up 
solutions  of  constraints  involving  arithmetic? 

One  approach  to  the  arithmetic  problem  is  to  employ  an  abstract  interpretation  style  ap¬ 
proximation  of  the  arithmetic  component  of  the  language.  That  is,  we  could  use  set  constraint 
techniques  to  reason  about  the  data-constructor  component  of  a  program,  and  abstract  interpre¬ 
tation  techniques  to  reason  about  its  arithmetic  components.  (With  similar  motivations,  a  hybrid 
combination  of  set  constraints  and  abstract  interpretation  is  developed  for  logic  programs  in  [9].) 

Such  an  approach  has  two  disadvantages.  First,  we  would  like  to  avoid  increasing  the 
complexity  of  the  algorithm  beyond  the  complexity  of  the  core  set  constraint  algorithm  (for 
analysis  of  ML,  the  complexity  is  0(n3)  where  n  is  the  size  of  the  input  program;  in  practice, 
it  can  be  engineered  to  be  nearly  linear).  This  means  that  we  have  to  place  severe  limits  on 
the  complexity  of  the  abstract  domain,  with  a  resulting  loss  of  information.  Second,  adding 
an  abstract  interpretation  mechanism  would  involve  adding  extra  machinery  to  the  algorithm, 
particularly  if  the  techniques  of  narrowing  and  widening  [3]  are  employed. 
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In  this  paper  we  develop  an  alternative  approach  that  effectively  computes  a  representation 
of  how  a  value  is  obtained  in  terms  of  the  basic  arithmetic  operations.  In  section  2  we  describe 
the  basic  extensions  of  set  based  analysis  needed  to  compute  descriptions  of  arithmetic  values. 
In  section  3  we  describe  an  important  extension  of  the  basic  approach  that  significantly  increases 
the  accuracy  of  the  analysis.  In  section  4  we  discuss  some  example  programs  and  provide  some 
preliminary  data  on  how  the  information  obtained  from  the  analysis  can  be  used  to  improve 
program  performance.  Finally,  in  section  3  we  outline  some  future  directions. 

We  conclude  with  a  brief  discussion  of  related  work.  This  paper  is  heavily  dependent  on 
a  companion  paper  [6],  which  describes  the  set  based  analysis  of  a  call-by-value  functional 
language  with  data-constructors,  references,  arrays,  exceptions  and  callcc.  Hence  there  are  close 
connections  with  this  work  and  work  related  to  set  based  analysis  (for  example,  [1,  11,  12,  14, 
15]).  There  has  been  little  work  on  the  analysis  of  arithmetic  in  languages  with  higher-order 
functions,  side-effects  aid  continuations.  However,  there  has  been  much  work  in  the  general 
area  of- analysis  of  programs  that  contain  arithmetic.  Some  of  these  works  focus  on  obtaining 
accurate  information  about  complex  relationships  between  program  variables.  For  example 
[4]  and  [10]  on  obtaining  information  about  linear  relationships  between  variables.  Another 
example  is  array  data  dependency  analysis,  which  is  a  local  analysis  directed  towards  detecting 
loop  level  parallelism  in  numeric  programs.  At  the  other  extreme,  type  analysis  has  been  used  to 
inexpensively  obtain  very  simple  information  about  arithmetic  expressions.  Another  important 
classification  is  range  analysis,  based  on  abstract  interpretation  [3].  Range  analysis  typically 
ignores  dependencies  between  variable  values,  but  is  somewhat  more  accurate  than  type  analysis 
approaches.  The  motivation  for  this  kind  of  analysis  is  to  determine  when  arithmetic  tests  (such 
as  those  for  array  bounds  checking)  can  be  safely  ignored.  Our  work  is  most  closely  related  to 
range  analysis,  and  our  motivations  are  similar. 


2  Arithmetic  Expressions 


Consider  a  simple  call-by-value  functional  language  whose  terms  e  are  defined  by 

e  ::=  x  |  c(et,...,en)  |  Xx.e  \e\e2\  case(ej,  c(xi,...,x„)  =»  e2,  y  =>  e3)  |  fixx.e 
1 1 1  et  arithop  e2  \  if  x  relop  y  then  e\  else  e2 

where  x,  x  i , . . . ,  xn,  y  range  over  program  variables,  c  ranges  over  a  given  set  of  (varying  arity, 
“first-order”)  constants,  i  ranges  over  integers1,  arithop  ranges  over  arithmetic  operations  such 
as  and  relop  ranges  over  comparison  operations  such  as  =,<>,<,<,  etc.  It  is 

convenient  to  adopt  the  usual  convention  that  each  bound  variable  is  distinct.  The  operator 
fix  serves  to  express  recursion2.  This  language  is  an  extension  of  the  language  used  in  [6], 
and  the  operational  semantics  we  shall  use  is  an  extension  of  the  one  given  there.  Specifically, 
environments  are  finite  mappings  from  program  variables  to  values.  Values  are  either  (i)  integers, 
(ii)  of  the  form  c(  , . . . ,  vn )  where  the  u,  are  values,  or  (iii)  closures  of  the  form  (E,  Xx.e)  where 

'The  restriction  to  integers  is  for  convenience;  the  implementation,  described  later  in  the  paper,  deals  with  floating  point  numbers 
as  well  as  a  variety  of  integer  data  types. 

2In  fix  x.e,  the  expression  e  shall  typically  be  an  abstraction. 
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E  is  an  environment.  We  briefly  outline  evaluation  of  expressions.  Evaluation  of  variables, 
abstractions,  applications  and  fix  expressions  proceeds  in  the  usual  way.  Evaluation  of  a  case 
statement  case(e  i,  c(xi,. . .  ,xn)  =>  e2,  y  =>  e^)  proceeds  by  firstevaluatingei.  If  the  result  has 
the  form  c(t>t, . . . ,  vn ),  then  the  variables  x,  are  bound  to  v,  and  e-i  is  evaluated.  Otherwise,  y  is 
bound  to  the  result  of  evaluating  e  i  and  then  ej  is  evaluated.  Evaluation  of  c(e  i , . . . ,  en )  proceeds 
by  evaluating  each  e,,  say  to  and  then  constructing  the  value  c(»i, . . . ,  t?n).  Arithmetic 
expressions  ei  arithop  are  evaluated  by  first  evaluating  e\  and  then  ti,  and  then  combining 
the  results  using  arithop.  A  conditional  statement  if  x  relop  y  then  e\  else  e2  is  evaluated  by 
first  evaluating  x  and  y,  applying  relop  to  the  results,  and  then  appropriately  branching  to  either 
e\  or  e2-  If  the  evaluation  of  an  arithmetic  operation  causes  an  exception,  then  the  computation 
is  aborted.  We  shall  write  E  I-  e  -+  v  if  e  evaluates  to  v  in  the  context  of  environment  E.  If  E 
is  the  empty  environment3  then  we  omit  it,  and  just  write  I-  e  — ►  v. 

In  [6],  we  gave  a  set  based  semantics  for  the  data-constructor  subset  of  the  above  language. 
This  was  achieved  by  replacing  the  use  of  environments  in  the  (exact)  operation  semantics,  by 
set  environments.  A  set  environment  is  like  an  environment  except  that  it  maps  variables  to 
sets  of  values  instead  of  individual  values.  By  this  means,  we  formalized  the  notion  of  ignoring 
dependencies  between  variables,  and  defined  the  set  based  approximation  sba(e o  )  of  a  program 
eo  (which  is  a  conservative  approximation  of  {u  :  1-  eo  — • •  t>}4).  We  then  extracted  constraints 
from  a  program  and  presented  an  algorithm  to  solve  these  constraints  such  that  the  output  of  the 
algorithm  is  a  representation  of  the  (possibly  infinite  set)  sba(e o).  Strictly  speaking,  the  output 
of  the  set  constraint  algorithm  was  not  sba(e0),  but  a  set  of  descriptions  from  which  sba{e0) 
could  be  reconstructed.  (The  algorithm  also  computes,  for  each  program  variable,  a  set  that 
conservatively  describes  all  values  that  the  variable  can  be  bound  to  during  program  execution.) 

We  now  extend  this  procedure  to  the  arithmetic  component  of  the  language.  The  key  idea  is 
to  generalize  the  notion  of  “description”  to  include  arithmetic  operations.  For  example  consider 
the  power  program. 


let  fun  power(0,  n)  =  1 

|  power(m,  n)  =  n  x  power(m-1 ,  n) 
in 

power(3,  4) 

end 

Program  1 

When  our  analysis  is  applied  to  this  program,  the  following  regular  tree  grammar  is  output: 

P  =>  (4  x  P) 

P  =>  1 

This  represents  the  set  of  expressions  { 1 ,  4x1,  4  x  (4  x  1),  4  x  (4  x  (4  x  1)),...},  and 
indicates  that  the  program  returns  a  value  that  is  obtained  by  starting  with  1  and  repeatedly 
multiplying  by  4  some  arbitrary  number  of  times. 
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3 That  is.  its  domain  is  the  empty  set  of  variables. 
4This  set  is  either  empty  or  a  singleton  set. 


We  begin  by  describing  the  set  constraints  we  shall  employ.  The  calculus  described  here 
is  an  extension  of  the  calculus  described  in  [6],  which  in  tum  is  based  on  the  earlier  works 
[8,  12,  15].  The  form  and  meaning  of  the  constraints  is  defined  in  the  context  of  some  given 
closed  term  eo.  We  assume  a  fixed  infinite  class  of  set  variables ;  set  variables  shall  be  denoted 
W,X,  y,  2.  We  distinguish  two  special  disjoint  subclasses  of  set  variables.  First,  for  each 
program  variable  x  in  eo,  there  is  a  distinct  set  variable  Xx  which  shall  be  used  to  capture  all 
of  the  values  for  the  program  variable  x.  Second,  for  each  abstraction  Xx.e  appearing  in  eo, 
there  is  a  distinct  set  variable  ran(Xx.e),  the  “range”  of  Xx.e,  which  shall  be  used  to  capture 
all  of  the  values  returned  by  applications  of  Xx.e  during  execution.  Now,  in  the  context  of 
the  given  term  eo,  we  define  that  a  set  expression  (se)  is  either  a  set  variable,  an  abstraction 
Ax.e  that  appears  in  eo,  an  integer,  or  of  one  of  the  forms  se i  arithop  se 2,  c(sei, . . .  ,se2), 
apply {se\,se{),  aue(s«i,c(<Y|,...,.¥n)  =>  sei,y  ^  sey)  or  ifnonempty(se\,se2).  The  first 
form  is  used  to  model  arithmetic  expressions,  and  is  the  main  change  from  [6],  The  second  form 
models  expressions  e( e  i , . . . ,  en ),  the  third  models  application,  the  fourth  is  for  case  statements, 
and  the  last  is  used  to  reason  about  emptiness.  A  set  constraint  is  an  expression  of  the  form 
X  D  se,  and  a  conjunction  C  of  set  constraints  is  a  finite  collection  of  set  constraints. 

We  now  outline  the  meaning  of  set  constraints.  Define  that  a  set  constraint  value  (sc-value) 
is  either  an  abstraction  Ax.e  that  appears  in  eo,  an  integer  i,  or  of  the  form  c(u(, . . . ,  un)  or 
u\arithopu2  where  each  u,  is  an  sc-value.  in  essence,  sc-values  are  description  of  values.  These 
descriptions  differ  from  values  in  two  respects:  (i)  the  descriptions  contain  arithmetic  operations, 
and  (ii)  the  descriptions  omit  the  environment  component  of  closures5.  Set  expressions  are 
interpreted  as  sets  of  these  descriptions  of  values.  Specifically,  an  interpretation  is  a  mapping 
from  each  set  variable  into  a  set  of  sc-values.  Such  an  interpretation  is  extended  to  map  set 
expressions  to  sets  of  sc-values.  The  rales  for  this  interpretation  are  essentially  identical  to  those 
in  [6].  Two  new  rules  are  needed  for  the  new  kinds  of  set  expressions: 

•  I(i)  =  {*},  where  i  is  an  integer,  and 

•  l(se\  arithop  se2))  =  {«i  arithop  u2  :  U\  €  T(se  1)  A  u2  6  I(se2)}. 

The  complete  definition  is  given  in  Appendix  I.  It  is  easy  to  verify  a  model  intersection  property 
for  the  set  constraints  used  in  this  paper,  and  it  follows  that  a  conjunction  C  of  constraints 
possesses  a  least  model,  denoted  lm(C),  where  models  are  ordered  as  follows:  It  3  I2 
if  I\(X)  3  l2(X),  for  ail  set  variables  X.  For  example  the  least  model  of  the  constraint 
X  3  (4x/T)U  1  maps  the  set  variable  X  into  the  set  of  values  { 1 ,  4x1,  4x(4xl),  4x 
(4  x  (4  x  1)),. . .}  (which  is  same  as  the  language  generated  by  the  grammar  given  earlier  in 
this  section). 

We  now  outline  how  set  constraints  are  constructed  for  a  program.  Since  this  construction 
is  an  extension  of  that  given  in  [6],  we  shall  only  discuss  the  additional  rales  required  (these 
correspond  to  the  arithmetic  expressions  of  the  language).  The  following  is  a  slightly  simplified 
version  of  the  new  rales  (the  complete  set  of  rules  appears  in  Appendix  II).  The  judgements6 

3  Importantly,  these  can  be  reconstructed  from  the  output  of  the  set  constraint  algorithm. 

‘Strictly  speaking,  the  judgements  used  in  Appendix  II  have  a  slightly  more  complex  form:  Z  1-  e  >  (,V,C).  The  additional 
set  variable  Z  is  used  to  reason  about  non-emptiness.  In  effect,  it  defers  the  actions  of  a  function  until  it  is  determined  that  the 
function  may  be  called. 
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of  this  system  have  the  form  e  t>  [X,C)  where  e  is  a  term,  -V  is  a  program  variable  and  C  is  a 
collection  of  set  constraints. 


«  >  (*»  {}) 


(INTEGER) 


_ ei  t>(A*i,C|)  ez  t>  (A2,C2) _ 

e\  arithop  e-i  >  (.y,  D  X\  arithop  X2}  U  C\  UC2) 


(ARITHOP) 


_ ei  \>  (X\,  C\)  e2  t>  {Xi,  C2) _ 

ifxi  relop  X2  then  e\  else  e2  >  (y,  {y  2  X\  U  <12}  U  C\  U  C2) 


(IF) 


Note  that  the  rule  for  conditional  statements  merely  joins  together  the  results  from  the  two 
branches  of  the  conditional  statement.  In  other  words,  information  about  the  “test”  part  of  the 
statement  is  ignored.  Clearly  this  loss  of  information  is  unacceptable;  we  address  this  issue  in 
the  next  section.  Let  .SC(eo)  denote  the  pair  (X,C)  constructed  for  a  term  eo  according  to  the 
constraint  construction  procedure  (C  is  the  collection  of  constraints  constructed  for  eo,  and  ,V 
is  effectively  a  pointer  into  the  constraints  that  indicates  which  set  variable  captures  the  set  of 
values  corresponding  to  e0). 

Meanwhile,  we  describe  the  changes  to  the  set  constraint  algorithm  of  [6]  that  are  needed 
to  deal  with  the  new  constraints.  In  effect  the  algorithm  is  extended  so  that  set  expressions 
of  the  form  sei  arithop  se 2  are  treated  in  an  identical  manner  to  set  expressions  of  the  form 
c(set,...  ,sc2)-  That  is,  arithmetic  operations  are  treated  like  data-constructors.  The  only 
difference  is  that  we  have  no  de-construction  operation  for  arithmetic  operations  -  we  can  build 
them  up,  but  we  can’t  break  them  down.  Formally,  this  is  achieved  by  adapting  the  definition 
of  atomic  set  expressions  to  include  expressions  of  the  form  aei  arithop  ae2  where  aei  and 
ae 2  are  atomic  set  expressions.  With  this  change  of  definition,  the  set  constraint  simplification 
algorithm  of  [6]  can  be  directly  applied.  When  input  with  a  collection  of  constraints  C,  this 
0(n3)  algorithm  computes  an  explicit  representation  of  the  least  model  of  the  constraint  C.  This 
representation  takes  the  form  of  a  regular  tree  grammar. 


We  now  formalize  the  connection  between  the  descriptions  of  values  computed  by  the  set 
constraint  algorithm  and  the  values  of  the  operational  semantics.  Define  a  meaning  function, 
which  maps  an  arbitrary  sc-value  into  an  sc-value  that  does  not  contain  arithmetic  operations,  as 
follows: 


w* 


Xx.e 

i 

[ml  arithop  \u2] 


if  u  is  Xx.e 
if  u  is  i 
ifuis  c(ui,. 

if  u  is  U|  arithop  U2,  [«i  [  and  [u2[  are  both  integers 


In  the  last  line  of  the  definition,  [ui]  arithop  [mj  denotes  the  result  of  applying  the  arithmetic 
operation  to  the  integers  [ut]  and  [m]-  For  example,  [(3  +  4)  -  l]  is  6.  Finally,  given  a  set  5 
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of  sc-values,  define  that  [5j  is  {[«J  :  u  €  S  A  {a]  is  defined}. 


Next,  we  extend  the  operator  ||u||  on  values  v  (defined  in  [6]),  whose  purpose  is  to  ignore 
the  environment  part  of  closures: 


IMI 


Aar.e 

i 


{  c(  ||u,  , . . . ,  ||  vn|| ) 


if  v  is  ( E ,  Ax.e) 
if  v  is  i 

if  v  is  c(vu.,.,vn) 


Finally,  we  can  state  the  correspondence  between  the  set  constraints  constructed  from  a  program 
and  the  operational  behaviour  of  the  program: 

Theorem  1  Let  eo  be  a  closed  term,  let  SC(e o)  be  (X,C)  and  let  lim  =  lm(C). 

Then  {||»||  :  h  e0  —  a}  C  [X/m(,V)J. 

The  correctness  of  the  analysis  procedure  described  in  this  section  follows  from  the  above 
theorem  and  the  correctness  of  the  set  constraint  simplification  algorithm  presented  in  [6].  Note 
that  complexity  of  the  whole  analysis  process  (constructing  set  constraints  from  a  program  and 
then  solving  them)  is  0(n3)  where  n  is  the  size  of  the  input  program.  This  is  because  the 
translation  of  a  program  into  its  set  constraints  is  O(n)  in  the  size  of  the  program  and  the  set 
constraint  algorithm  is  an  0(ra3)  algorithm. 


3  Conditional  Statements 


In  the  previous  section,  we  outlined  a  basic  analysis  for  a  call-by-value  functional  language 
involving  arithmetic.  A  limitation  of  this  analysis  was  that  it  ignored  information  in  the  tests  of 
conditional  statments.  To  see  the  specific  ways  where  the  treatment  of  conditional  statements 
loses  information,  consider  the  statement  if x j  relop  xz  then  ej  else  ei.  Now,  if  the  test 
x\  relop  xz  always  succeeds  or  always  fails,  then  the  analysis  will  infer  inferior  information 
because  it  unnecessarily  joins  together  the  values  obtained  from  both  e\  and  ez.  In  contrast, 
the  analysis  of  a  case  statement  case(ei,c(xi,...,xn)  =>  ez,y  63)  ignores  the  values  from 
the  ez  and  until  they  are  activated  (i.e.  until  it  is  determined  that  e\  may  contain  values  that 
necessitate  that  the  particular  arm  be  executed).  This  is  modeled  by  the  case  set  expression. 
Unfortunately,  it  is  difficult  to  treat  conditional  statements  in  an  analogous  manner.  This  is 
because  the  set  constraints  reason  about  descriptions  of  arithmetic  computations,  and  from  these 
descriptions  there  is  no  simple  way  to  determine  whether  a  branch  will  be  taken.  (The  key  point 
is  that  this  test  must  potentially  be  done  at  each  step  of  the  algorithm;  it  is  therefore  imperative 
for  efficiency  that  it  be  a  local  test.  Note  that  for  case  statements,  the  corresponding  test  -  also 
potentially  done  at  each  step  -  is  a  trivial  constant  time  test.)  This  is  a  direct  reflection  of  the 
difference  between  the  simple  structural  equality  of  data-constructors,  and  the  more  complex 
equality  of  arithmetic. 

However,  there  is  a  much  more  significant  loss  of  information  than  the  issue  of  “dead” 
branches  in  conditional  statements.  Consider  the  statement  if  x  >  y  then  e\  else  ez,  and  for 
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simplicity,  suppose  that  the  only  free  variables  in  et  and  e2  are  x  and  y.  Now,  inside  expression 
e  |  it  will  always  be  the  case  that  x  >  y  and  inside  e2  it  will  be  the  case  that  x  <  y .  In  other  words, 
the  conditional  statement  serves  to  restrict  the  possible  bindings  to  variables  in  the  scope  of  its 
body.  From  an  analysis  perspective,  it  is  therefore  useful  to  think  of  conditional  statements  as  a 
form  of  variable  binding  mechanism.  More  concretely,  suppose  that  the  conditional  statement 
is  to  be  executed  in  environment  p  and  let  e\  and  e'2  be  the  result  of  renaming  x  and  y  into  new 
variables  x'  and  y'  in  e\  and  e 2.  Then,  at  an  informal  level,  the  evaluation  of  e\  and  e2  can  be 
viewed  as  proceeding  in  the  following  environments: 


environment  for  e\ 

environment  for  e2 

x'~p([>  y](x)) 
y'  h*  p([<  x](y)) 

P([<  »](*)) 
y'  ^  p{[>  *](»)) 

where  [>  y](x)  denotes  the  value  of  x  if  x  >  y  and  is  undefined  otherwise,  [<  x]( y)  denotes 
the  value  of  y  if  V  <  x  and  is  undefined  otherwise,  and  similarly  for  [<  y)(x)  and  [>  x](y). 
More  generally,  define  an  expression  [relop  y](x)  which,  in  the  context  of  some  environment  p, 
denotes  p[x)  if  p(x)  relop  p(y)  and  is  undefined  otherwise. 

Such  a  view  of  conditional  statements  could  be  formalized  to  give  an  alternative  operational 
semantics  that  emphasizes  the  latent  binding  effects  of  conditional  statements.  Rather  than 
spell  out  the  details  of  this,  we  shall  instead  focus  on  how  these  intuitions  can  be  employed  in 
analysis.  In  effect,  we  shall  treat  expressions  such  as  [>  j/](  x )  in  much  the  same  way  as  we  treat 
expressions  such  as  1  +  3.  As  before,  we  shall  construct  set  constraints  from  a  program  that  shall 
reason  about  descriptions  of  values  (rather  than  the  actual  values),  but  now  these  descriptions 
shall  involve  expressions  of  the  form  [relop  «i](u2)  in  addition  to  ut  arithop  u2. 

To  this  end,  we  first  introduce  a  new  kind  of  set  expression  which  has  the  form  [  relop  se  1  ]  ( se2 ) . 
We  shall  also  extend  the  definition  of  sc- value  to  include  expressions  of  the  form  [relop  ui](u2). 
The  new  form  of  set  expression  shall  be  interpreted  as  follows: 

X([relop  sei](se2))  =  {[relop  ui](u2)  :  «i  6  X{se  1 )  A  u2  €  I(se2)}. 

Accordingly,  the  definition  of  the  function  [  J ,  which  maps  an  arbitrary  sc-values  into  its  meaning, 
is  extended  as  follows:  [[relop  «i](u2)]  =  u2  if  u2  relop  u\  and  is  undefined  otherwise. 

Finally,  the  construction  of  constraints  is  modified  so  that  when  an  expression  of  the  form 
if*  relop  y  then  e\  else  e2  is  encountered,  occurrences  of  x  and  y  in  e\  and  e2  are  treated 
specially: 

•  when  x  is  encountered  in  e\ ,  the  set  expression  [relopXy\(Xx)  is  generated  instead  of  Xx\ 

•  when  y  is  encountered  in  ei,  the  set  expression  [op(  relop)Xx}(Xy)  is  generated  instead  of 
Xy, 

•  when  x  is  encountered  in  e2,  the  set  expression  [neg(  relop)Xy](Xx)  is  generated  instead 
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of Xx,  and 


•  when  x  is  encountered  in  ei,  the  set  expression  [op(  neg(  relop )).Vr](,Vy)  is  generated 
instead  of  Xy. 

where  op  maps  a  relational  operator  into  its  opposite  (for  example  op{<)  is  >),  and  neg  maps 
a  relational  operator  into  its  negation  (for  example  neg(<)  is  >).  Strictly  speaking,  we  shall 
cascade  this  process,  so  that  at  the  occurrence  of  x  in  the  expression  x  +  1  of  the  conditional 
statement 

if  x  >  y  then  if  x  <  z  then  x  +  1  else  1  else  2 

the  set  expression  [<  <-VI]([>  Xy](Xx))  is  generated. 

We  conclude  this  section  with  an  example  illustrating  the  kinds  of  output  generated  by  this 
modification  to  the  analysis.  Consider  the  following  ML  program  for  adding  up  the  elements 
in  an  array.  The  functions  update,  sub  and  length  are  the  update,  subscript  and  array  length 
operations;  the  expression  array(10,  3)  creates  an  array  of  size  10  with  each  element  initialized 
to  3  (the  valid  subscripts  of  the  array  are  0..9). 

let  fun  cum  (arr :  int  array)  = 

let  fun  f  i  =  if  (i  =  0)  then  arr  sub  0 
else  (arr  sub  i)  +  f(i  -  1 ) 
in 

f  ((length  arr)  -  1) 

end 

in 

cum  (array(10,  3)) 

end 

Program  2 

Of  particular  interest  is  the  set  of  values  for  i.  If  we  can  determine  that  i  is  always  in  the 
range  0..9  then  the  array  bounds  check  can  be  eliminated  during  the  subscript  operation.  When 
applied  to  program  2,  the  analysis  described  in  this  section  outputs  the  following  tree  grammar 
corresponding  to  the  program  variable  i,  where  1  is  the  set  variable  corresponding  to  i: 

1=>  10—1 
*=*«*  01(1)) -1 

That  is,  the  set  of  values  obtained  for  i  is  the  set  consisting  of  the  integer  9  as  well  as  any  integer 
obtained  by  subtracting  l  from  any  non-zero  integer  described  by  I.  It  is  easy  to  verify7  from 
this  description  that  the  possible  values  for  i  must  lie  in  the  range  0..9,  and  so  the  array  bounds 
check  is  not  required. 

7  In  general,  some  postprocessing  is  needed  to  obtain  the  requited  properties  from  the  descriptions  output  by  the  analyzer.  We 
address  this  issue  further  in  the  Section  5. 
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For  the  purposes  of  comparison,  we  now  outline  how  the  abstract  interpretation  approach 
based  on  the  domain  of  intervals  with  widening  and  narrowing  would  perform  on  this  example. 
The  abstract  values  of  the  interval  domain  take  the  form  [x,  y],  [x,  oc],  [-00,  y]  or  [-00, 00], 
where  x  and  y  are  integers.  An  interval  [x,  y]  indicates  the  set  of  all  integers  t  such  that  x  <  i  <  y. 
(See  [3]  for  further  details.)  We  now  trace  out  the  behaviour  of  the  analysis  and  focus  on  the 
values  obtained  for  the  program  variable  i.  The  first  value  for  i  is  9,  and  this  is  represented  by  the 
interval  [9,9].  During  the  next  iteration,  a  new  value  8  is  obtained  by  i,  and  this  is  represented  by 
the  interval  [8, 8].  When  these  two  intervals  are  combined  using  the  widening  operator  typically 
employed  for  this  domain8  the  result  obtained  is  [-oc,  9].  This  completes  the  widening  phase  of 
the  analysis  (at  (east  as  far  as  t  is  concerned).  Next,  the  narrowing  phase  of  the  analysis  begins. 
However,  for  this  program,  it  is  not  possible  to  obtain  a  tighter  bound  on  i  using  narrowing,  and 
so  the  final  result  is  [-00, 9],  which  does  not  imply  that  the  array  bounds  check  can  be  removed. 
Observe  that  if  the  test  in  the  conditional  statement  were  changed  to  x  <  0  instead  of  x  =  0, 
then  the  abstract  interpretation  approach  Would  obtain  the  description  [0, 9]  for  i. 


4  Examples 


We  now  consider  two  more  substantial  examples  to  illustrate  the  kinds  of  information  that  our 
analysis  can  obtain.  The  first  example  is  a  small  package  (about  100  lines)  that  implements 
two-dimensional  arrays  using  the  one-dimensional  arrays  of  SML  of  New  Jersey.  The  package 
provides  subscript  and  update  operations,  as  wall  as  a  matrix  multiplication  operation.  This 
package  was  analyzed  in  the  context  of  a  benchmark  that  multiplies  two  100  x  100  matrices. 
The  time  for  analysis  was  approximately  0.06s9.  Of  particular  interest  are  the  manual  bounds 
check  operations  for  each  dimension  of  the  two-dimensional  arrays.  A  typical  example  of  the 
results  obtained  by  the  analysis  for  an  index  involved  in  bounds  checking,  is  given  by  the  set 
variable  X  in  the  following  regular  tree  grammar: 

X  =>  [<  100]Q>) 

y  =►  ([<  loojoo)  + 1 
y  =>  0 

Focus  first  on  the  set  variable  y.  The  values  described  by  y  consist  of  the  integer  1  and  any 
value  obtained  by  adding  1  to  some  value  in  y  that  is  strictly  less  than  100.  If  is  easy  to  see  that 
the  values  described  by  y  are  exactly  the  interval  0..100.  If  follows  that  the  set  described  by 
X  is  the  interval  0..99,  and  therefore  the  bounds  checks  can  be  eliminated.  In  fact  the  analysis 
determines  that  all  of  the  two-dimensional  bounds  checks  can  be  eliminated  for  this  program. 
This  resulted  in  an  approximately  40%  speedup  in  the  benchmark. 

The  second  example  is  the  core  part  of  the  set  based  analysis  implementation  (approximately 
2700  lines).  It  relies  less  heavily  on  array  operations,  although  it  does  contain  substantial 
symbolic  and  imperative  aspects.  It  also  make  significant  use  of  higher-order  functions:  in 

8In  general,  widening  is  needed  to  obtain  a  terminating  abstract  interpretation  analysis  using  the  interval  domain. 

9  All  execution  times  reported  in  this  section  are  in  seconds  on  an  PMAX  5000/200  with  64M  and  running  Mach.  The  analysis 
is  implemented  in  Standard  ML  of  New  Jersey  (2],  Version  0.93. 
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particular  functions  are  stored  in  lists,  data-structures  and  arrays  and  then  recovered  and  called 
when  necessary.  An  important  part  of  the  design  of  the  system  is  the  use  of  integers  to 
provide  a  compact  and  efficient  representation  of  certain  objects  call  terms.  The  class  of  terms 
is  partitioned  into  identifiers  and  compound-terms.  Negative  integers  are  used  for  identifiers 
and  positive  integers  are  used  for  compound-terms.  Both  identifiers  and  compound-terms  are 
allocated  using  global  references  that  contain  the  respective  next  integers  to  be  allocated.  Finally, 
both  kinds  of  terms  are  used  to  index  into  arrays.  The  following  is  suggestive  of  the  allocation 
mechanism  for  compound-terms. 

val  last-compound -term  = .... 
val  next-compound-term  =  ref  1 
fun  new.compound.term  ()  = 

let  val  i  =  Inext.compound.term  in 
next-compound  .term  :=  i  +  1 ; 
i 

end 

where  ....  indicates  the  computation  of  the  maximum  size  of  integers  used  for  compound-terms 
(it  is  computed  from  some  parameter  describing  the  problem  size).  Any  arrays  indexed  by 
compound-terms  are  created  with  size  last-compound-term  +  1 .  There  is  a  similar  mechanism 
for  the  allocation  of  identifiers,  except  that  identifiers  are  allocated  in  the  opposite  direction, 
starting  from  - 1 .  These  term  objects  appear  in  almost  all  parts  of  the  program,  are  stored  in 
arrays  and  data-structures  and  also  appear  in  the  closures  of  higher-order  functions. 

The  analysis  of  this  second  example  was  performed  in  about  7.6s.  The  results  of  the  analysis 
for  variables  that  range  over  term  objects  are  typically  of  the  following  form  (the  set  variable  .  V 
describes  the  possible  values  of  the  program  variable  of  interest). 

,Y  =>  —((20000/3)  +  1  )}(y) 

X  =>  [fi  ((20000/3)+  \)\{Z) 

X  =>  0 

y  =>  -1 

y  =>  [#  -((20000/3)+  1)1(30-1 
y  =>  o 
z  =>  1 

2  =>  &  ((20000/3)  +l)](2)+l 
Z  =>  0 

Note  that  the  expression  (20000/3)  -I-  1  corresponds  to  the  initial  computation  to  determine 
the  bounds  on  the  allocation  of  identifiers  and  compound-terms.  The  above  description  clearly 
reflects  the  allocation  of  compound-terms  (starting  from  1  and  incrementing)  and  the  allocation 
of  identifiers  (starting  from  - 1  and  decrementing).  The  possible  value  of  0  is  used  to  represent 
an  uninitialized  term  and  is  used  in  a  part  of  the  program  at  which  a  variable  must  be  defined 
before  it  can  be  given  its  proper  value. 

Most  importantly,  the  descriptions  for  the  variables  that  are  used  for  subscripting  into  arrays 
have  one  of  the  following  two  forms:  [>  Oj(A')  or  -[<  0](A').  In  either  case,  the  description 
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establishes  that  the  array  operation  can  be  done  without  bounds  checking.  Preliminary  results 
indicate  that  when  these  array  operations  are  done  without  array  bounds  checking,  overall 
performance  of  the  analyzer  improves  by  about  6%  -  1 3%,  depending  on  the  program  analyzed. 


5  Conclusion 


We  have  presented  an  extension  to  set  based  analysis  for  the  treatment  of  arithmetic  expressions 
in  an  ML-like  language.  The  analysis  yields  revealing  descriptions  of  how  arithmetic  values  are 
obtained  during  the  execution  of  a  program.  We  have  investigated  the  use  of  this  information 
to  remove  array  bounds  checks  for  two  different  styles  of  programs.  Preliminary  data  suggests 
that  the  approach  presented  here  will  scale  up  to  large  programs. 

We  observe  that  the  information  computed  by  the  analysis  described  by  this  paper  is  not 
explicit  in  the  sense  that  properties  cannot  be  directly  read  from  it.  Rather,  some  post-processing 
is  required.  In  practice,  the  information  that  is  relevant  to  a  particular  program  variable  is 
usually  quite  small  and  fairly  easy  to  reason  about,  even  for  moderate  sized  programs.  In 
general,  however,  it  may  involve  complex  recurrence  relations10.  One  area  of  future  work  is  the 
development  of  tools  to  reason  about  the  output  of  the  analyzer. 

Although  the  use  of  a  somewhat  implicit  representation  of  values  is  a  disadvantage  in  the 
sense  that  postprocessing  is  required,  it  is  also  an  advantage  because  it  means  that  we  can 
compute  a  much  richer  variety  of  properties  that  is  possible  in  other  approaches  (for  example,  in 
abstract  interpretation,  we  fix  in  advance  an  explicit  representation  used  in  the  analysis,  but  in  the 
process  we  restrict  the  properties  that  can  be  inferred).  In  effect,  our  approach  defers  the  choice 
of  “interesting  properties”  until  the  post-processing  stage.  This  leads  to  an  important  level  of 
modularity,  since  when  we  wish  to  analyze  for  a  new  property,  we  typically  only  need  to  construct 
a  new  postprocessing  stage.  For  example,  suppose  that  for  alignment  purposes,  we  wished  to 
determine  whether  a  byte  array  was  always  created  with  a  size  that  was  a  multiple  of  4.  Such 
information  could  be  determined  from  the  output  of  our  analysis  by  adapting  the  postprocessing 
stage  for  the  property  of  interest.  In  comparison,  consider  an  abstract  interpretation  analysis 
based  on  intervals  (such  as  used  in  the  example  at  the  end  of  Section  3).  In  order  to  modify  the 
abstract  interpretation  analysis  for  this  new  property,  we  would  need  to  completely  redesign  the 
analysis  to  appropriately  enrich  the  abstract  domain. 


Future  Work 


Much  work  remains  in  the  use  of  the  analysis  during  compilation.  The  results  relating  to  array 
reported  in  Section  4  are  very  preliminary.  Further  applications  involving  data  representation, 

>0In  some  sense,  the  approach  outlined  in  this  paper  can  be  viewed  as  a  technique  for  decomposing  the  large  problem  of  reasoning 
about  the  arithmetic  operations  of  an  entire  program  (which  contains  other  aspects  such  as  higher  order  functions,  continuations  and 
data-stnictures)  into  a  number  of  smaller  “local"  arithmetic  problems.  Since  these  problems  are  typically  much  smaller  than  the 
original  problem,  and  are  often  important  for  optimization  purposes,  we  may  wish  to  use  more  expensive  analysis  procedures  on 
these  subproblems  than  we  could  justify  for  analysis  of  the  entire  program. 
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binding  time  analysis  and  flat  types  [13]  are  being  investigated.  The  approach  of  computing 
descriptions  of  how  a  value  is  obtained  also  seems  relevant  to  other  kinds  of  complex  values 
such  as  strings,  sets  and  Lisp  futures. 
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Appendix  I :  Interpretation  of  Set  Expressions 

1.  K«)  =  {*}; 

2.  X{se i  arithop  se2))  =  {«i  arithop  v-i  :  v\  G  X{se\ )  A  v\  €  X(se \ )}; 

3.  X{c(sel,...,sen))  =  {c(wi,. . . ,»„) :  vt  G  l{set),i  =  l..n}; 

4.  X(Xx.e)  =  {Ax.e}; 

5.  l(ifnonempty(sei,s€2))  =  ifl(sei)  =  {}  then  {}  elseX(se2); 

6.  l(apply(se\,sei))  =  |u  :  Ax.e  6  X(se i)  A  X(se 2)  ^  {}  A  v  G  X(ran(Ax.e))  } 

provided  Ax.e  G  X(se\)  impliesl(se2)  C  X{Xx) 

7.  I(case(5ei,c(A'i,. . .  ,«Vn )  ^  se2i^  => -Sf3 ))  =  ^1  u  ^2. 

where  (*)  5|.  =  {v  :  v  G  X(se2)  A  3c(vi,...,  un)  G  I(sei)} 

(ii)  S2  =  {  v  :  v  £  X(se 3)  A  3t/  €  I(aei)  s.t.  v'  ^  c(v  1, . . . ,  t>„)} 

(iii)  c( V| , . . . , un )  €  X(sei)  implies  Vi  G  X(Aj),  i  =  l..n 

(»v)  veX(se  1)  A  o^c(t;1 . »„)  implies c'(u|,...,i>n)  G  X(}>) 


Note  that  the  above  interpretation  of  set  expressions  is  somewhat  unusual,  because  in  parts  4  and 
5  of  the  definition,  the  set  expressions  themselves  impose  restrictions  on  X.  If  these  conditions 
are  not  met,  then  the  interpretation  of  the  expression  is  undefined. 
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Appendix  II :  Construction  of  Set  Constraints 


2  \-  x  t>(,Yr,  {}) 


(VAR) 


2  I-  {}) 


(INT) 


_ Xx  I-  e  >  (X,  C) _ 

2  Xx.e  C>  ( y ,  {3  2  Xx.e,  ran(Ax.e)  3  <Y}  UC) 


(ABS) 


_ 2  ^  ex  t>  (Xj,  Ci)  2  t-  e2  \>  (X2,  C2) _ 

2  h  e\  e2  >  (3,  {3  2  apply(y,X 2),  y  3  ifnonempty(2,X\)}  U  C\  UC2) 


(APP) 


2  h  e,  t>(A',,C,),  i  =  l..n 

2  I-  c(C|,...,en)  >(3,  {JDc(A'|1...,A;)}uC,U...ug 


(CONST) 


Z  ex  >  (Xx,Cx)  2he2>(X2,C2) 

2  h  e  1  arithop  €2  fc>  (3^-  {3  2  «Y|  arithop  X2}  U  C\  U  C2) 


(ARITHOP) 


2  e  1  >  (/Y|,  C\ )  2  h  t>  (X2,  C2)  2  K  ej  >  («Y3.  C3) 

2  ca«e(e|,c(X|,. . . , xn )  =>  e2?  3/  =>  e3)  >  (3-  C  U  C\  U  C2  U  C3)  t 


(CASE) 


whereC  =  {3  2  cose(y,c(A'a.,,...,<YXB)  =►  22,<YV  =*•  23),  3'  3  ifnonempty(2.X\)} 


2  e\  t>  (<Y|,  Ci)  2  H  C2  >  (.Y2,  C 2) 

2  (-  i/xi  ne/op  X2  Men  e\  else  e2  t>  (3,  {3  2  <Yi  U  ^2}  U Ci  U  C2) 

_ 2  h  e  t>  (X,  C) _ 

2  I-  fixx.e  t>(Xx,{Xx  3  ifnonempty(2,X)}  uC) 


(IF) 

(FIX) 


Figure  1:  Construction  of  Set  Constraints 

Figure  1  presents  the  complete  details  of  the  constructions  of  set  constraints  for  a  term.  The 
judgement  2  h  e  t>  ( se,C )  recursively  passes  down  a  set  variable  which  is  empty  if  the 
expression  under  consideration  is  never  called,  and  is  non-empty  otherwise.  A  key  property  of 
the  judgement  2  t-  e  t>  (se,C)  is  that  if  2  is  empty  then  C  is  vacuously  true.  We  now  define 
SC(e 0)  as  follows:  if  2  is  a  new  set  variable  and  2  I-  e  t>  ( X,C ),  then  «SC(eo)  is  the  pair 
(/Y,  {2  3  u}uC)  where  u  is  some  arbitrary  sc-value.  Note  that  all  sc-values  are  set  expressions 
and  that  the  choice  of  u  is  arbitrary  -  its  only  purpose  is  to  force  the  variable  2  to  be  nonempty, 
since  otherwise  the  constraints  C  would  be  vacuously  true. 


15 


School  of  Computer  Science 
Carnegie  Mellon  University 
Pittsburgh,  PA  15213-3890 


Carnegie  Mellon  UrverS'*y  dot's  no:  discdmtr.au?  tr*q  Oa."-eqto  Me 
cJiSCnrn.'-'atc  m  adP'"  SSi-'.»r  erop'oyrv.en?  Of  aorrir.s*ra:on  •>*  .i~  pfocj' 
national  <y«q  n  sex  <y  n^r.cjicao  -r  voiation  o'  T*t:e  VI  o’  :w»  C  v-i  R. 
Fducatiooai  Amendments  ._<*  '<)72  a  no  5ec»  on  bU4  n*  »^e 
state  or  local  laws,  or  executive  orders 

In  addition.  Ca'negre  Me  'or-  U^vers-ty  coos  no:  a  scr  nv^jn  ■  ann 
trattoo  of  its  programs  or  «»1r?  t)3S:S  0f  'ei  q  on  creed  .-rcos-ry.  rx 
orientation  or  p  viOtat’On  o'  fedora;  state  or  :ot,a  mws  1  uxoc-it.vx. 

Inquiries  conte"  orj  iioo.-  anor  r/  t^ese  slat*--"  nets  sr:  ;  <i  t:«*  <rr 
Meilon  Un.yers'ty  5G0f»  cnd>es  Avenue.  P-.tfsOwrgd  PA  ifv>v>  vorr 
President  for  F  nro!irr>;*-,r  r3!.  ,>g.0  Mehon  University  bOOO  F'o'nes 
telephone  (41?)  368-2056 


